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Abstract: Occupancy models are typically used to determine the prob¬ 
ability of a species being present at a given site while accounting for im¬ 
perfect detection. The survey data underlying these models often include 
information on several predictors that could potentially characterize habi¬ 
tat suitability and species detectability. Because these variables might not 
all be relevant, model selection techniques are necessary in this context. In 
practice, model selection is performed using the Akaike Information Cri¬ 
terion (AIC), as few other alternatives are available. This paper builds 
an objective Bayesian variable selection framework for occupancy models 
through the intrinsic prior methodology. The procedure incorporates pri¬ 
ors on the model space that account for test multiplicity and respect the 
polynomial hierarchy of the predictors when higher-order terms are consid¬ 
ered. The methodology is implemented using a stochastic search algorithm 
that is able to thoroughly explore large spaces of occupancy models. The 
proposed strategy is entirely automatic and provides control of false posi¬ 
tives without sacrificing the discovery of truly meaningful covariates. The 
performance of the method is evaluated and compared to AIC through 
a simulation study. The method is illustrated on two datasets previously 
studied in the literature. 

Keywords and phrases: Imperfect detection, intrinsic priors, model pri¬ 
ors, strong heredity, Bayesian variable selection, AIC. 


1. Introduction 

It is often the case that measurements recorded for a given response are, at 
best, a noisy version of the variable of interest. A particular case of this issue 
is known as imperfect detection, and constitutes a pervasive problem. For in¬ 
stance, in biological surveys subject to imperfect detection, “presence/absence” 
data for a given species actually become “detection/non-detection” data, since 
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a specie may be present at a given site, but may go undetected in a survey. 
Ignoring imperfect detection may lead to inaccurate measurement of the pres¬ 
ences (Guillera-Arroita et ah, 2014), which generally results in biased parameter 
estimates (MacKenzie et ah, 2002). 

As defined in the ecological literature, occupancy is the proportion of sites 
where a target species is present, constituting a state variable instrumental to 
assess the distribution of species (MacKenzie et ah, 2002). Over the past decade, 
site occupancy models have been the main tool used by ecologists to estimate 
occupancy while accounting for imperfect detection. Occupancy models describe 
the observed data by linking two processes: presence and detection. Occupancy 
models adapt the conventional binary regression model to produce separate es¬ 
timates for presence and detection probabilities (Dorazio and Taylor-Rodriguez, 
2012). This separation is possible by surveying repeatedly the sampling loca¬ 
tions, which provides additional information to better assess if non-detection of 
the specie truly corresponds to its absence. Conveniently, these models can be 
fitted even if the number of surveys is unbalanced across sites. The core of the 
occupancy model is characterized by the hierarchy 

yij\zi Bern(2:ipjj) 

Zi ~ Bern(V'j), (1.1) 

where yij is the binary detection indicator at the ith site (f = 1,..., N) during 
the jth survey (j = The detection probability for event {yij — 1} 

is pij whenever the species is present; and Zi is the presence indicator at the 
ith site with success probability Note that the Zi are imperfectly observed. 
At at site z, whenever the vector of detections ^ 0 , then we know that 
Zi = 1, but yi = 0 does not imply that Zi = 0. To produce estimates of tpi 
and Pij, site occupancy surveys collect information on several predictors with 
the potential to influence habitat suitability (characterizing ipi) and species 
detectability (defining Pij). Given that some of the collected predictors may be 
uninformative or redundant, variable selection techniques are instrumental in 
identifying good models. 

In this paper, we propose an objective Bayesian variable selection procedure 
for occupancy models. Our approach is based on intrinsic objective priors for 
the model parameters, and uses priors over the model space that simultaneously 
account for test multiplicity, and, if interactions and/or polynomial terms are 
considered, enforces the polynomial hierarchical structure among predictors. 

Gurrently, variable selection procedures for occupancy models implemented 
in statistical software are mainly based on the Akaike Information Criterion 
(AIC) (Akaike, 1983). As a consequence, these procedures do not allow for valid 
post-selection inference and uncertainty quantification, and typically require 
enumerating and fitting every possible model in the space of models under con¬ 
sideration (e.g., Mazerolle and Mazerolle, 2013; Fiske and Chandler, 2011). In 
practice, this enumeration is feasible only if the model space is small enough, 
either because substantial knowledge about the underlying ecological processes 
is available to constrain the model space, or because only a few variables are 
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considered to begin with. Nevertheless, many site occupancy surveys collect 
large amounts of covariate information about the sampled sites, and since the 
total number of candidate models grows exponentially in the number of pre¬ 
dictors, choosing a reduced set of models based on ecological intuition becomes 
increasingly difficult. 

The AIC is designed to find the model that is the closest to the true (un¬ 
known) model with respect to Kullback-Leibler divergence, identifying as good 
models those with smaller AIC values. It has been shown, however, that the 
AIC has certain limitations as a model selection criterion. For instance, if nested 
models are being considered, the AIC will not necessarily select the true model 
(Wasserman, 2000). In fact, the AIC generally shows a weak signal-to-noise ra¬ 
tio and tends to prefer more complex models, even if the true model is available 
(Rao and Wu, 2001). Other versions of the AIC address this issue by including 
a bias correction factor that enhances the signal-to-noise ratio (see Hurvich and 
Tsai, 1989; McQuarrie et al., 1997); however, these modified versions cannot be 
used with occupancy models, as they depend on the effective sample size, which 
is unknown for these models. 

In this context, Bayesian methods are more appealing. Under regularity con¬ 
ditions, when the true model is contained in a fixed model space, its posterior 
probability converges to one as the number of sites and surveys per site both 
increase. In addition, if the true model is not contained in the model space, the 
posterior probability of the most parsimonious model closest to the true data 
generating process tends asymptotically to one. In the finite sample context, 
Bayesian methods allow for full and faithful error propagation. Furthermore, 
the Bayesian machinery provides the means to conduct valid inference account¬ 
ing for model uncertainty. 

A Bayesian selection procedure for occupancy models was described in Hooten 
and Hobbs (2015). However, their implementation uses informative prior distri¬ 
butions on the model parameters, tailored specifically to the example discussed 
in the paper, which prevents the approach from being applicable to occupancy 
problems in general. It is often the case that subjective elicitation of parame¬ 
ter and model prior distributions is not possible, since neither the relationship 
between the response and the predictors, nor the advantages of one model over 
another, are clearly understood. In addition, the use of seemingly innocuous 
subjective priors may drastically affect outcomes. This has been a recurring 
argument in favor of objective Bayesian procedures, which appeal to the use 
of formal rules to build parameter priors that incorporate the structural infor¬ 
mation inside the likelihood while utilizing some objective criterion (Kass and 
Wasserman, 1996). 

To the best of our knowledge, the method proposed in this article is the first 
general Bayesian selection procedure for occupancy models, that 

1. bypasses the need for hyper-parameter tuning, 

2. uses priors specifically designed for testing, 

3. controls for test multiplicity, and 

4. accounts for the hierarchical polynomial structure in the predictors. 
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In building our approach, we first derive intrinsic priors (Berger and Peric- 
chi, 1996; Moreno et al., 1998) for the model parameters in both the presence 
and detection components of the single-season occupancy model. For the model 
priors, we consider the ones proposed in Taylor-Rodriguez et al. (2015). These 
priors, in addition to controlling for test multiplicity, allow restricting the model 
space to the set of models that respect (weakly or strongly) the polynomial hi¬ 
erarchy among the predictors whenever interactions and higher-order terms are 
considered. As discussed in Peixoto (1987, 1990) when covariate interactions 
and polynomial terms are present, failure to restrict the class of models to those 
respecting strong heredity may result in incoherent variable selection. This is 
because the model design matrices are not invariant to linear transformations 
of order-one predictors (e.g., recentering of the main effect variables). Using 
the derived intrinsic priors on the parameter space and the multiplicity correc¬ 
tion priors on the model space, we build a fast stochastic search algorithm that 
allows us to thoroughly explore large spaces for the single-season occupancy 
model framework. This strategy is completely automatic, avoiding the need for 
both tuning parameters in the sampling algorithm and subjective elicitation of 
parameter prior distributions. Furthermore, as any other Bayesian approach, it 
naturally enables parameter and model uncertainty quantification. 

The outline of the paper is as follows: in Section 2, we provide background 
on occupancy models and set notation. In Section 3, we introduce our objective 
Bayesian model selection method and develop the Gibbs sampler. In Section 
4, we present results from a simulation study and a comparison with selection 
using AIC. In Section 5.2, we illustrate our methodology on two datasets, which 
have been previously examined in the ecological literature (Kery et al., 2005; 
Kery et al., 2010; Dorazio and Taylor-Rodriguez, 2012). We conclude the paper 
with a brief discussion. Code for all the tools proposed is available in the R 
package OccO Bayes. A description of the stochastic search algorithm is included 
in the Appendix. 

2. Inference for a single model 

This section briefly describes the estimation procedure for a single model. As¬ 
suming the probit link, the occupancy model can be characterized in terms of 
latent variables, which in turn allows one to relate the detection and occupancy 
probabilities to predictors. We build an objective prior distribution for the re¬ 
gression coefficients using the expected posterior prior framework (Perez and 
Berger, 2002) where we condition on both the observed data as well as the 
unobserved latent variables (Leon-Novelo et al., 2012). 

2.1. The Occupancy Model with Probit Link 

The occupancy model in (I.l) is completed in two ways. First, the probabilities 
for detection pij and for presence tpi are linked to vectors of predictors q^- 
and Xi, respectively, through appropriate link functions, gp{pij) = and 
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9')p(/Pi) = We assume that the link function is the inverse standard normal 
cdf, leading to probit models. Other binary regression models can be fit and lead 
to slightly more complicated computational algorithms. Second, the parameters 
of the underlying space, here (ct. A), are given a prior distribution 7r(a, A). This 
paper proposes a prior distribution building on the expected posterior prior 
method of Leon-Novelo et al. (2012). 

Letting X and Q be the matrices whose rows are, respectively, vectors x( and 
qL for i = 1,... ,N and j = 1,..., Jj, the Bayesian probit occupancy model is 
specified as 

yij\zi,a,X,Q,X ~ Bein{ziP,j) with = $ (q^A) 

Zi|Q;,A, Q,X ~ Bern('(/)i) with = <& (x'a) 

ct, A|Q, X ~ TT, (2.1) 

where <i> is the standard normal cdf. As it will be made evident in subsequent, 
we explicitly condition on X and Q since the priors devised for the model 
parameters depend on these design matrices. Again, note that the Zi are not 
perfectly observed. The sites with = 0 provide no detections but this does 
not necessarily imply a lack of presence. Thus, the model is a zero-inflated 
binary regression model where both lack of presence and individual instances of 
detection are predicted with covariates. The observed data vectors for the sites, 
yi,..., y„, are independent given (a. A) and 


‘-IviltO} 


p(y,|a,A,Q,X)= (x'a) ^ $ (qLA)"- (l - $ (qhA))' 

X I $ (x'a) n (1 - ® + (1 - ^ (XiO!)) 


‘■{yi=o} 


The model can be expanded in the spirit of Albert and Chib (1993) by intro¬ 
ducing latent variables at each level. Let Vi be the underlying continuous latent 
variable for presence at site i and Wij be the underlying continuous latent vari¬ 
able for detection during survey j from site i. The hierarchical model in (2.1) 
becomes 


Uij \ ^i^ J ^ij : ^5 Q j ^ 

Wij\zi,Vi,a,\,Q,X 

2:i|ui,a, A, Q,X 

Vi\a,\,Q,X 
a, A|Q, X 


fV(qCA,l) 

^{vi>0} 

N (x'a, 1) 

TT. 


( 2 . 2 ) 


When one uses a multivariate normal prior for (a. A), the model in (2.2) can 
be fit using a Gibbs sampler. As described in Dorazio and Taylor-Rodriguez 
(2012), the only complication in using a Gibbs sampler is the fact that the sign 
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of Vi determines the value of Zi and so the Gibbs sampler has to proceed in 
two blocks. The first block, which corresponds to a multivariate normal draw, 
is (ct, A|z, V, w, y, Q, X). The second block is (v, w, z|q:, A,y,Q,X). Each Zi is 
drawn from the distribution [zi\a, A,y,Q,X], which is a Bernoulli distribution 
with probability of success 





and the Vi and Wij are sampled independently from their full conditionals. Each 
Vi has a truncated normal distribution with mean x'a and variance 1, restricted 
to the positive real line when Zj = 1 and to the negative real line when Zi = 0. 
Each Wij has a truncated normal distribution with mean q.^jA and variance 1 
that is supported on the positive real line when ZiUij = 1, the negative real line 
when Zi{l — i/ij) = 1, and the whole real line when Zi = 0. 

The marginal p(y|X, Q) for the observed data can be estimated using the 
output from the Gibbs sampler (Chib, 1995). In this sampling scheme, one 
can also perform parameter expansions for both v and w (Liu and Wu, 1999). 
These dramatically decrease the autocorrelation between successive samples and 
reduces the asymptotic variance of estimators (Roy and Hobert, 2007). 

Alternatively, one can perform inference for the model specified in (2.1) di¬ 
rectly using a Metropolis-Hastings algorithm (e.g., an independence chain, a 
random walk, or Hamiltonian Monte Carlo). The output of the Metropolis Hast¬ 
ings algorithm can be used to estimate the marginal of the observed data using 
the method outlined in Chib and Jeliazkov (2001). When the sample size is 
large, an independence chain, using the Laplace approximation to the posterior 
as a proposal density, provides accurate numerical estimates of the posterior 
evaluated at its mode in a relatively small number of samples. 

2.2. An Objective Prior for (a, A) 

Intrinsic priors, as defined by Moreno et al. (1998), are an example of expected 
posterior priors (Perez and Berger, 2002). Concisely, an expected posterior prior 
for parameter 6 with prior ttm under a model M is given by 


T^M{0\TiM,rnQ) = I pM{0\D,'KM)mo{D)dD, 


where D is some imaginary data that is integrated out, pm((^\D,t:m) is the 
posterior of 6 given data D under the model M with parameter prior ttm, and 
Too is a hxed distribution for generating the data D. The properties of the data 
D are determined by the investigator. For regression problems, this amounts to 
determining the number of samples in the response and the associated design 
matrix. The generating model toq for the data D is usually taken to be a simple 
model, for instance an intercept-only model. Thus, the expected posterior prior 
under model M is calibrated to the distribution toq. 
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Consider the context of multiple models, Mq, Mi,...,, where Mq is 
nested in Mp for all k and model Mp has parameter 9k with non-informative 
(often improper) prior . In this context, Mq is referred to as the base model. 
The intrinsic prior for each model is computed as 

,m^) = J pZ^{9k\Dk,TTk)m^{Dk)dDk, 

where Dk is a training sample for model Mp and rriQ is the marginal density for 
Dk under the base model. For the intrinsic prior. Dp is taken to be a minimal 
training sample for model Mk under the prior , which is a dataset of the 
smallest possible size that provides a proper posterior for p^^{Gk\Dk,n^). Of 
course, the intrinsic prior for the base model is just its original non-informative 
prior. When the prior for model Mp is improper and only defined up to a mul¬ 
tiplicative constant Ck, the intrinsic prior framework removes the ambiguity of 
these constants and each intrinsic prior is defined up to a common multiplicative 
constant cq. 

An extension of the intrinsic prior framework is to have the datasets Dk 
include both observable and unobservable latent variables. Leon-Novelo et al. 
(2012) used this approach in computing an objective prior for standard probit 
regression. There, the authors conditioned on both the observed binary data as 
well as the unobserved continuous latent variables. Following their development, 
we form an objective prior for the occupancy model by conditioning on the 
unobserved latent presence variables (z) as well as the unobserved continuous 
latent variables for both presence and detection (v, w). We refer to this objective 
prior as an intrinsic, prior though its derivation differs from that in Moreno et al. 
(1998) and Berger and Pericchi (1996). 

Specifically, let Xq and Qo be design matrices for presence and detection in 
the model Mg and let X and Q be design matrices for a model M that nests Mq. 
Let (a. A) and (ao) -^o) be the parameters of M and Mq, respectively. Further, 
assume that the prior distributions for the parameters under each model are 
constant, = cm and ttq = cq. The intrinsic prior for (a. A) is given by 


JPi 

‘M 




(a, A|Q,X) = X! // PM(«>'^|z,v,w,y,Q,X)m^(z, v, w, y|Qo, Xo)dv dw, 


(2.3) 

where the over variables indicates that these correspond to the training 
sample that is to be integrated out. The formula in (2.3) is greatly simplified 
by the fact that, under the non-informative prior, (a. A) are conditionally in¬ 
dependent of (z,y) given the continuous latents (v,w). Moreover, the cx and A 
are conditionally independent of each other given the continuous latents. Thus, 
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(2.3) simplifies to 

= yy Pm(«|v, w, Q, X)p))^(A|v, w, Q, X)m^(v, w|Qo, Xo)dv dw 

= J PM(a|v,X)m^(v|Xo)dv X J p)^(A|w, Q)m()^(w|Qo)dw, 

(2.4) 

where the last equality follows from the assumptions of (2.2) and the prior 
independence of a and A under Both of the integrals in (2.4) are of the 
form of the integrals in Leon-Novelo et al. (2012). Thus, the intrinsic prior is 
given by a product of singular normal distributions. 

The explication of these priors is greatly aided by the introduction of addi¬ 
tional notation. Because Mq is nested in M, we can write X = (Xq X^) and 
Q = (Qo Qa) and can do the same for the design matrices for the minimal 


training sample. Similarly, we 

can 

write 


a),)' 

and A = 

(A',A( 4 )'. The 

intrinsic prior is given by 








M 

(^0, 2 1 

h- ( 

1 



) (2.5) 

•^aI-^OjQ ~ 

M 

(^0, 2 1 

(Qa (i- 


)m)"' 

) (2.6) 


Co 

X do 




(2.7) 


where Ho^ and Hq,, are the hat matrices associated to Xq and Qo, respectively. 

Here we include two constants cq and do for the reference prior for the base 
model, where Co and do ^re undefined constants for the fiat priors for ccg and 
Ao, respectively. 

The only remaining task for this intrinsic prior is to define the design matrices 
for the minimal training samples. Letting po, = dim(Q:) and px = dim(A), the 
minimal training samples for v and w contain po, and px samples, respectively. 
Following Leon-Novelo et al. (2012) and Casella and Moreno (2006), we define 
X and Q to be any design matrices of dimensions Po, x pa and px x px satisfying 

X'X = ^X'X and Q'Q = ^Q'Q, (2.8) 

where N is the number of sites and J, = total number of sur¬ 

veys. Note that the covariance matrices in (2.5) and (2.6) are thus completely 
determined by X'X and Q'Q. 

3. The Variable Selection Problem 

The hierarchy in Equation (2.2) is given for a specific model with a fixed set 
of predictors. This section develops the model selection problem for occupancy 
models. Each model contains two components, one for presence and one for 
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detection. Thus, model M is decomposed as M = (My^M^), where My is a 
component model for detection and is a component model for presence. 
The base model is Mq = {M ^^, ), where the component base model design 

matrices contain at least a column of ones for the intercept. Each model M is 
assumed to nest Mq and the prior for model M is taken to be the intrinsic prior 
defined in (2.5)-(2.7). The largest model is denoted by Mp = [^Mp^,Mp^') and 
contains the largest possible component models for detection and presence. The 
design matrices for these full components are and Qi?. 

Let K = {Ky^Kz), where Ky and denote the sets of column indices for 
Qi? and li-p that are not in Qo and Xq, respectively. The model space can 
then be represented by the Cartesian product V (Ky) x V (K^), where V (B) 
is the powerset of B. A specific model is represented by A = {Ay,Az) with 
Ay C Ky and A^ C K^. Thus, the entire model space M is populated by models 
of the form Ma = {^Ay ■, Ma^ ), where MAy and Ma^ are the corresponding 
component models for detection and presence determined by the base covariates 
as well as covariates with indices in Ay and A^, respectively. It follows that for 
the presence process z, the design matrix for the model is of the form 
Xma = (^0 Xa), where Xq is the design matrix of the base model Mq^ 
and Xyi is the matrix containing the covariates indexed by Az (and similarly 
for Qma = (Qo Qa))- Denote the regression coefficients of the model Ma by 
(A-Ma = ^'a)' ^Ma = ^'a)' for presence and detection, respectively. 

It is important to note that this construction using the Cartesian product 
provides the largest possible model space for the occupancy model given the 
structures of the base and full models. Investigators may wish to impose ad¬ 
ditional model space restrictions based upon their (subjective) judgment. One 
means of achieving this restriction is to form two sets of models, Aiy for de¬ 
tection and M.Z for presence. The model space can then be defined by the 
Cartesian product, M. = Aiy x Miz- One particular example of such a restric¬ 
tion arises when higher-order terms are included in the detection or presence 
models. Heredity conditions (Chipman, 1996) can be imposed on either model 
space and appropriate priors dehned (see Section 3.1). 

3.1. Priors over the Space of Models 

Here we outline the construction of prior distributions over the model space. 
To allow for flexible modeling, it is assumed that the sets of covariates can 
potentially include interaction effects, higher-order polynomial terms, and factor 
variables. The priors for either the presence or the detection component have 
the same structure, and the joint prior is the product of marginal priors of the 
two model components. 

The priors placed on the model space for the presence and detection models 
respect the hierarchy of the terms that could be included in a given model. 
Aspects of the prior construction are described here and full details on such 
priors can be found in Taylor-Rodriguez et al. (2015). The full model for either 
the presence or the detection component is represented as a directed acyclic 
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graph (DAG) with nodes representing polynomial terms (powers or interactions; 
e.g., xi or xl or X 1 X 2 ) and with edges specifying inheritance relationships. For 
example, xix"^ has edges (inherits) from its parent nodes X 1 X 2 and also 
x\ inherits from its parent xi but not from X 2 - Feasible models, also known 
as models obeying weak heredity, correspond to a special kind of connected 
subgraph of the full model DAG. First, they must include the base model DAG. 
Second, a node ry can only be included in a model’s subgraph only if there 
is a directed path from a node in the base model to rj. The priors considered 
here focus on models satisfying a strong heredity (also known as well-formulated 
models), which amounts to requiring that for each node ry in a model’s subgraph, 
all parents of ij included in the model’s subgraph. 

Model prior probabilities are specified recursively via conditional node in¬ 
clusion probabilities (given the parent DAG) using a type of Markov condition 
reflected in the principles of conditional independence and immediate inheri¬ 
tance (Ghipman, 1996). Conditional node inclusion is identified with a latent 
Bernoulli random variable and placing a conjugate (beta) prior on the inclusion 
probabilities. The model space prior is obtained by integrating out the condi¬ 
tional inclusion probabilities. In the simplest case all of conditional inclusion 
probabilities are assumed to be equal and the prior is called the hierarchical 
uniform prior (HUP). The amount of penalization of complex models can be 
adjusted (typically, increased relative to the purely combinatorial penalization 
of the HUP) using node-specific inclusion probabilities and stronger shrinkage 
via the beta hyper-priors on the inclusion probabilities; this results in the hierar¬ 
chical independence (HIP) and order priors (HOP) that group nodes of similar 
complexity together. 

3.2. Model Posterior Probabilities 

In order to compute the posterior probabilities of interest, we take advantage 
of the model representation making use of the latent variables introduced for 
the presence and detection processes. Specifically, a conditional independence 
argument provides 

w(y,z,w,v|MA)7r(MA) 
w(y,z,w,v) 

/y,z(y, z|w, v)m(w, v\Ma) TrjMA) 

/y.z(y, z|w, v) Em-gAI v|M*)7r(M*) 

m(w, v|M^)7r(M^) 
m(w, v) 


p(MA|y,z,w,v) = 
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because z is independent of Ma once v is known and y is independent of Ma 
once z and w are known. In (3.1), 


/y,z(y,z|w,v) 

m(w,v|MA) 


N J 

i=i i=l 

w(v|MaJto(w|M^ ) 


N 


N Ji 


i[^{v,\k[cx,i;maj linn (^(wylq^A, 1 ;MaJ 

i=ij=i 




A|Q,X)dQ:dA, 


(3.2) 


with (^(’l/x, cr^; M) denoting the normal pdf with mean /x, variance cr^ conditional 
on model M, and 7r^(a, A|Q,X) as defined in (2.4). 

Under the intrinsic priors above, the closed-form expression for the marginal 
to(v|M^^) is 


to(v|M^) 


Co (27r) 


-("-Po*)/2 


PA, 


2N PA, 


(pa, -PO, ) 


|X(,Xo| 


exp 




2N 


2N pa. 


Hi. 


X 


(3.3) 


where is the hat matrix associated with (I—Ho.)X. 4 . Similarly, the marginal 
distribution for w under model Ma is 


to(w|M^) = do (27r) po„)/2 


PA, 


2.J% -f PAy 


IQoQoi 


exp 




2J. 


2J, + Pa, 


Hi 


(3.4) 


where is the hat matrix associated with (I — Hoj^)Qyi and J, = ^i 
is the total number of surveys. Finally, the marginals for the base model Mg = 
{Moy,Mo,) are 


to(v|Mo) 

= JcoAf{- 


= co(27r) 

and 


m(w Mo) 

= do(27r) 


IXU, 


qAoI "exp 


--(v' (I-HoJv) 


(X,-P0«) 


IQoQol ^ exp 


"2 K (I-Hojw) 


(3.5) 

(3.6) 


The specification of the model posteriors in Equation (3.1) is completed using 
the construction of the priors tt{Ma) over the model space; see Section 3.1. 
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The advantage of (3.1) is that the posterior of model can be represented 
as 

p{MA\y) = JJJ p{MA\y,z,w,v)f{z,w,v\y)dzdwdv, (3.7) 

which provides for straightforward ergodic estimation of p(Myi|y) if samples can 
be drawn from /(z,w,v|y). If S such samples are obtained, then (3.7) can be 
approximated by 

5-i^p(M^|y,zW,wW,vW). (3.8) 

e 

Such draws can be obtained using reversible jump MCMC (RJMCMC) (Green, 
1995), as described in the Appendix. One subtle point of difficulty is the cal¬ 
culation of m(w,v) = m(w, v|MA)7r(MA) in the denominator of (3.1) 

when the space of models is too large to be enumerated (or if the necessary 
calculations for each model and each draw of (w,v) are too arduous). In such 
a case, the sum may be approximated by T~^ to(w, v|M(*))7r(M(*)), where 
t indexes a set of T models. For instance, t could index the set of models vis¬ 
ited during the RJMCMC sampler or a larger set of models could be used (the 
posterior of a model Ma not in this set can be estimated using (3.8)). 

4. Simulation Experiments 

This section considers nine different scenarios where we explore a range of de¬ 
tectability and prevalence regimes to assess the behavior of the proposed algo¬ 
rithm. For each model component, the base model is taken to be the intercept- 
only model, and the full models considered for the presence and the detection 
have, respectively, five and three predictors. Therefore, the model space con¬ 
tains 2® X 2^ = 256 candidate models. The assumed true models are M^z = 
{l,Xi,X 2 ,X 5 } for the presence and Mxy = { 1 , 92793 } for the detection, where 1 
represents the intercept. This small model space is considered so that compar¬ 
isons with selection using AIC (which generally requires complete enumeration 
of the model space) can be made. 

The simulation scenarios we consider vary depending on where the distribu¬ 
tions for the detection and presence probabilities are centered. That is, we set 
the average probability for detection and presence to predefined values p and ijj, 
respectively. If the detection probabilities are centered near one, a non-detection 
commonly implies a non-presence since the detection is almost perfect. On the 
contrary, if the detection probabilities are centered close to zero (as with cryptic 
species), then the uncertainty surrounding an observed zero is greater, making it 
more difficult to determine if this also corresponds to a true zero in the presence. 
Now, combining the different values for p with different values for the center of 
the distribution for the presence probabilities ip, we can account for a variety 
of possibilities observed in real data, ranging from cryptic but highly prevalent 
species, to easy to detect but very rare species. 

The mean probability values for detection and presence that determine our 
scenarios correspond to the pairs {p,ip) G {0.2, 0.5,0.8}^^. To match the target 
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values (p, V'), 15 independent sets of {X.p, Qi?} were drawn independently from 
the standard normal distribution, and for each of them the true model param¬ 
eters were chosen to solve for ct and A the equations = V' p(A) = p, 
where 


= ^^d'(x»and 

^ i^l 

N Ji 

pw = 

Z^i-l i=l 2 = 1 

For each scenario and dataset combination, we used the best solution from ten 
runs of a gradient-based (quasi-Newton) algorithm initialized from independent 
standard normal draws. Finally, having determined the regression parameters 
corresponding to the different scenarios and conditioning on Mtz and Mry, 
the true presence and detection indicators were drawn from the probit model 
described by (2.1) for each dataset. 

The results are shown in Figure 1, which depicts the average proportion of 
true positive (TP) and false positive (FP) predictors included in the selected 
models under each scenario. The TP predictors are those in the true model 
that are also in the selected model, and the FP predictors correspond to those 
absent from the true model but included in the chosen model. The selected 
models are the lowest AIC model and the median probability model (MPM) 
under the objective Bayes methodology. The MPM is the model that includes 
all predictors whose marginal posterior inclusion probability (MPIP) is greater 
than or equal to 0.5, where the MPIP for a given predictor is defined as 

^(predictor is included|y) = E P{^\y:-^^^IpredictorGM} ■ (^-l) 

MeM 

The TP and FP rates for both detection and presence components lead to the 
same conclusions. In terms of the TPs, the AIC selects a slightly higher number 
of true positive terms, especially for the component of the model associated to 
the presence indicators. Nonetheless, these differences are modest at most. Con¬ 
versely, the resulting proportions of false positive terms (FP) tend to be strink- 
ingly lower using our method, especially for the presence component in those 
scenarios where there is poor detection (i.e., p = 0.2). Remarkably, whenever 
the species is highly prevalent {ip = 0.8) and detection ranges between moderate 
and high {p — 0.5, 0.8), the number of false positive terms under our approach is 
very close to zero in both model components. Also, with {p,ip) = (0.2,0.8) our 
method substantially outperforms AIC in filtering out the false positive terms 
both in the presence and detection components. 

These results are very encouraging: the proposed method not only reduces the 
inclusion of false positive terms in comparison to AIC but also has comparable 
performance finding true predictors. 
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(a) Presence 



(b) Detection 

Fig 1. Proportion of true positives (TP) and false positives (FP) using the proposed approach 
and AIC for the detection and the presence components of the model. 


5. Case studies 

In this section, we analyze two datasets. First, we consider presence-absence data 
for mallard wild ducks {Anas platyrhynchos) ^ collected as part of the 2002 Swiss 
breeding bird monitoring program. For our second example, we consider the blue 
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hawker dragonfly data, which had been previously studied using AIC as the 
variable selection strategy in Kery et al. (2010). The mallard data is extremely 
clean, with sufficient sites being surveyed, which for the most part are visited the 
same number of times. On the other hand, the blue hawker dataset was collected 
through a large scale citizen science effort. As such, although the number of 
sites visited is large for this type of data, it displays large asymmetries in the 
surveying effort, posing a more challenging problem for this type of analysis. 

Both data sets contain a sufficiently small number of predictors so that enu¬ 
meration of the entire model space is feasible. Therefore, for these data analy¬ 
ses, we present estimators of posterior probabilities from enumeration (EPE), 
renormalization (RPE), and visit frequency (FPE). While all estimates exhibit 
Monte Carlo error, we treat the enumeration estimators as a gold standard es¬ 
timator because the Monte Carlo error can be easily controlled. We implement 
the method of Chib and Jeliazkov (2001) for estimation of the marginal and use 
a relative magnitude stopping rule to determine the length of sampling (Flegal 
and Gong, 2013). In particular, we require that the 95% confidence interval for 
the estimator for the log posterior evaluated at its mode be less than 1% of the 
size of the estimate. 

To obtain the EPE for each model, we run the MCMC algorithm dehned by 
(2.2) using the priors given in (2.5)-(2.7). These yield draws of the regression 
coefficients conditional on each model, which are then used to calculate the 
marginal density of the response. We calculate the EPEs using the marginals 
obtained under each model. Once the EPEs are in place, we then compare them 
to their corresponding MCMC estimates using either FPE or RPE. Expression 
(3.8) enables direct calculation of the RPEs for a specified set Ma of models, 
which may even include models that were not sampled. Given the moderate size 
of the model space for these examples, in both cases we set Ma to be the entire 
model space. In contrast, as a general rule the EPEs are only available for the 
set of the visited models in the RJMCMC. Finally, to compare our results to the 
traditional approach using AIC, we use the “Akaike weights” (see Burnham and 
Anderson, 2003; Burnham, 2004, for a definition and further information). These 
are obtained using functions occu and dredge from the R packages unmarked and 
MuMin, respectively. The AIC weights allow us to make direct comparison of the 
results provided by either method, as they can be seen as posterior probabilities 
obtained from a specific prior on the model parameters. However, as AIC is 
minimax-rate optimal for estimating the regression function, it cannot be a 
consistent model selector, as demonstrated in Yang (2005), making these priors 
ill-suited for variable selection. 

5.1. The mallard data 

As Switzerland is a small and mountainous country, it provides for large varia¬ 
tion in its topography and physio-geography. As such, elevation is a good candi¬ 
date to predict species occurrence at a large spatial scale. It can serve as a proxy 
for habitat type, intensity of land use, temperature, as well as some other biotic 
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factors (Kery et al., 2010). The data used in the illustration was collected by the 
Swiss breeding bird survey, and had been previously used to derive abundance 
estimates in Kery et al. (2005). 

The monitoring program for common breeding bird species comprises more 
than 250 1-km^ quadrats distributed in a grid sample across Switzerland. Through¬ 
out the breeding season, each quadrat is surveyed two or three times annually 
by an experienced surveyor along a route, recording the date and whether vi¬ 
sual or acoustic contact was made. Elevation (elev) and forest cover (forest) 
were matched for the studied locations from the Swiss Federal Statistical Office 
(Kery and Schmid, 2004). Given that the route length (length) across quadrats 
was not homogenous, route length (within a quadrat) was considered to account 
for variation in effective sample area. To model the detection probabilities, sur¬ 
vey duration divided by route length (ivel) was used as a measure of effort. Also 
the date (date) was considered for the detection component since the surveys 
were collected over a three month period, and behavioral changes that might 
affect detection could be expected. Using the built-in feature of our algorithm 
to account for the polynomial structure in the predictors, we considered a full 
quadratic surface for the predictors, both in the presence as well as in the detec¬ 
tion component. The dataset contains 235 quadrats, of which two were surveyed 
once, 42 twice, and 191 were visited three times. 

5.1.1. Results 

As mentioned above, given that this dataset contains only a few covariates, even 
when considering the full quadratic surfaces, it is possible to perform complete 
enumeration of the model space (which has 1,235 models). The results from our 
analyses are summarized in terms of the MPIPs (calculated using (4.1)), the 
top ranked models (in terms of their posterior probabilities), and the Median 
Probability Model (MPM), which is the model containing only terms whose 
MPIPs are greater than 0.5. These measures were all obtained for each method 
using the posterior probabilities from the joint model for presence and detection. 

Table 1 displays the MPIPs calculated with EPEs, RPEs, FPEs and AICu,. 
Although the MPIPs obtained from EPE are lower than those from the two other 
estimates (RPE and EPE), for the most part all three share the same ordering, 
with the exception of the length^ term in the presence component. It is worth 
noting that, although the MPIPs are comparable for the three alternatives, those 
from RPE are considerably closer to those from EPE than those from RPE, 
especially for the detection component. The MPIPs from AIC^, are considerably 
higher for most predictors than any of their Bayesian counterparts, implying 
that good models resulting from AIC selection are more complex, as expected. 

Using each of the first three columns displayed in Table 1 one can extract a 
median probability model (MPM). Following the same approach, with the last 
column in Table I, we obtain the 50% threshold model using the AIC weights. 
The MPM matches for RPE and EPE, and this model in turn is similar to that 
from EPE, but the latter excludes the forest term in the presence component. 
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Table 1 

MPIPs from joint model for the presence (top) and the detection (bottom) components for 

the mallard dataset 



EPE 

RPE 

EPE 

AIC„ 

elev 

0.9966 

1.0000 

1.0000 

1.0000 

forest 

0.9446 

0.9525 

0.9489 

0.9987 

length 

0.4305 

0.5998 

0.5983 

0.9625 

length*forest 

0.2153 

0.3803 

0.4090 

0.8737 

elev*length 

0.2069 

0.3336 

0.3491 

0.7561 

elev*forest 

0.1297 

0.1448 

0.1732 

0.3577 

elev^ 

0.1110 

0.1293 

0.1620 

0.3347 

forest^ 

0.1067 

0.1229 

0.1504 

0.3077 

length^ 

0.0734 

0.1440 

0.1639 

0.5333 



EPE 

RPE 

EPE 

AIC„ 

date 

0.1315 

0.1982 

0.3846 

0.5573 

ivel 

0.0538 

0.1476 

0.3568 

0.3086 

date^ 

0.0258 

0.0560 

0.1119 

0.3645 

ivel^ 

0.0133 

0.0540 

0.0980 

0.3220 

ivel*date 

0.0012 

0.0250 

0.0645 

0.0527 


In spite of this discrepancy, it is noteworthy that the MPIP using EPE for this 
term is 0.4305, being relatively close to the 0.5 threshold for the MPM. The 
comparable model obtained using AIC weights is considerably larger than all 
the MPMs resulting with EPE, RPE and EPE, all of which are nested within 
it. 


Table 2 

MPMs obtained from MPIPs using EPE, RPE and EPE and pseudo-MPM with AIC 
weights for the mallard dataset 



Detection 

Presence 

EPE 

{1} 

{1, elev, forest} 

RPE 

{1} 

{1, elev, forest, length} 

EPE 

{1} 

{1, elev, forest, length} 

AIC^ 

{1, date} 

{1, elev, forest, length, length*forest, elev*length} 


Finally, Table 3 displays the five highest probability models (HPMs) under 
the three calculation alternatives, as well as those resulting from AIC based 
ranking. Remarkably, the highest probability model is the same under the true 
posterior probabilities and the two estimation methods considered. Among the 
set of top models resulting from EPE, four are among the top five from RPE, 
and three are among those from EPE. Additionally, the model ranked fifth 
using EPE, which does not match with any of the top five HPMs from RPE 
or EPE, is ranked eighth and ninth with RPE and EPE, respectively. Also, 
models ranked fifth under RPE (which coincides with model four with EPE) and 
fifth under EPE, which are not among the top five with EPE, are respectively 
ranked eighth and seventh with EPE. Again, more complex top models result 
from AIC selection in the presence components, and notably the model posterior 
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probabilities are highly diluted across the model space, with the five top models 
concentrating only about 8% of the posterior mass. This contrasts markedly 
with the mass harnessed by the top five models with the other three methods, 
which are approximately 26% with FPE, 43% for RPE and 55% with EPE. 


Table 3 

Top five models with EPE, RPE, EPE and AIC for the mallard dataset 

EPE 



Detection 

Presence 

p{My,M/y) 

1 

{1} 

{l,e lev, forest} 

0.3101 

2 

{1} 

{l.elev, length,forest} 

0.0954 

3 

{1} 

{l.elev, length,forest,elev*length,length*forest} 

0.0634 

4 

{1} 

{l,elev, length,forest,elev*length} 

0.0420 

5 

{1} 

{l,elev,forest, elev*forest} 

0.0373 

RPE 


Detection 

Presence 

p(My,M/y) 

1 

{1} 

{l,elev,forest} 

0.1821 

2 

{1} 

{l,elev, length,forest,elev*length,length*forest} 

0.0933 

3 

{1} 

{l,elev, length,forest,elev*length} 

0.0576 

4 

{1} 

{l,elev, length,forest} 

0.0572 

5 

{1} 

{l,elev, length,forest, length*forest} 

0.0431 

FPE 


Detection 

Presence 

p(My,M/y) 

1 

{1} 

{l,elev,forest} 

0.1063 

2 

{1} 

{l,elev, length,forest,elev*length,length*forest} 

0.0600 

3 

{1} 

{l,elev, length,forest,elev*length} 

0.0354 

4 

{1} 

{l,elev, length,forest, length*forest} 

0.0300 

5 

{l,date} 

{l,elev,forest} 

0.0284 

AIC™ 


Detection 

Presence 

AIC™(Mj,,Mdy) 

1 

{l,date} 

{l,elev,forest, length,elev*length,forest*length} 

0.0192 

2 

{l,date} 

{l,elev,forest, length, length^,elev*length,forest*length} 

0.0190 

3 

{1} 

{l,elev,forest, length, length^,elev*length,forest*length} 

0.0136 

4 

{1} 

{l,elev,forest, length,elev*length,forest*length} 

0.0136 

5 

{l,date^} 

{l,elev,forest, length,elev*length,forest*length} 

0.0121 


The results in Tables 1-3 indicate that estimating the model posterior proba¬ 
bilities using either RPE or FPE yield reasonable approximations to the actual 
posterior probabilities. In particular, all methods rank models similarly, and if 
model averaging was to be performed, these would all produce comparable re¬ 
sults, as the derived MPIPs resemble each other under the three alternatives. 
Nonetheless, following the results from Table 1 we prefer RPEs, as these appear 
to be converging faster towards the benchmark posterior values (EPEs). These 
results are consistent with the findings from exhaustive simulation experiments 
conducted in Taylor-Rodriguez et al. (2015), where overwhelming evidence was 
found in favor of renormalized model posterior estimates when compared to the 
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frequency-based ones in the multiple linear regression problem. For occupancy 
models, this behavior is more conspicuous in the detection component than in 
the presence one, possibly due to the additional uncertainty arising from only 
partially observing the presence indicators. In addition to the observation that 
the renormalized posteriors are closer to those from enumeration, in larger model 
spaces where not all models are visited by the stochastic search, it is possible to 
calculate renormalized posteriors for a larger set of models than those visited, 
while with frequency-based estimates this is not possible. 

5.2. Blue hawker data 

During 1999 and 2000, an intensive volunteer surveying effort coordinated by the 
Centre Suisse de Cartographie de la Faune (CSCF) was conducted to analyze 
the distribution of the blue hawker, Ashna cyanea (Odonata, Aeshnidae), a 
common dragonfly in Switzerland. Repeated visits to 1-ha pixels took place to 
obtain the corresponding detection history. In addition to the survey outcome, 
the X- and y-coordinates, thermal level, the date of the survey, and the elevation 
were recorded. Surveys were restricted to the known flight period of the blue 
hawker, which occurs between May 1 and October 10. In total, 2,572 sites were 
surveyed at least once during the surveying period. The number of surveys per 
site ranges from 1 to 22 times within each survey year, with as many as 67% 
of the sites being surveyed only once, and only 5% of the sites being surveyed 
more than 3 times. As such, the analysis of this data set is an illustration of a 
considerably more challenging problem. 

Kery et al. (2010) summarize the results of this effort using AlC-based model 
comparisons. To select the predictors in the detection component, the authors 
follow a backwards elimination approach while keeping the presence component 
fixed at the most complex model. To select the presence model, they choose 
among a group of three models while using the chosen detection model. The full 
models considered in this study are 

= Ao-I-Aiyear-f A 2 elev-f Aaelev^-I-A 4 elev^-I-Asdate-I-Agdate^ 

= ao-I-ai year-I-Qf 2 elev-I-a 3 elev^-I -04 elev^, 

where the term year denotes 'I-{ye 3 r= 2 om} ■ 

Assuming these full models and intercept only base models (and disregarding 
the polynomial hierarchy among predictors), the model space for this problem 
contains 2®+"*^ = 1, 024 models in the joint model space. However, if the polyno¬ 
mial structure is respected, without considering interactions (for compatibility 
with the analysis in Kery et al. (2010)), the size of the model space for the de¬ 
tection component reduces to 24 models, and to eight models for the presence. 
This corresponds to a total of 192 models in the combined space. In the exer¬ 
cise below, when using the proposed approach we enforce the strong heredity 
condition through the priors over the model space. 

As in the analysis of the Mallard dataset, we obtain the EPEs, the RPEs, and 
the FPEs. The model ranks obtained with the posterior probabilities (or their 
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estimates) are compared to those resulting from AIC selection. The functions 
used to conduct selection with AIC did not constrain the model space to respect 
strong heredity, hence for the AIC selection all 1024 models were considered. 
All results are compared to the models ultimately recommended by Kery et al. 
(2010), given by 

Detection: {l, elev, elev^, date, date^} 

Presence: |l,elev,elev^,elev^} . 


5.2.1. Results 

Table 4 shows the MPMs from either of the approaches considered obtained 
with the MPIPs found in Table 6 of Appendix B. The MPMs obtained with 
RPE and FPE coincide, and are similar to that from EPE, with the the latter 
additionally including the elev^ term. The pseudo-MPM that results when using 
AIC weights contains all the term included in the MPMs from RPE and FPE, 
but adds the elev^ and year terms in the detection component. Note that this 
model does not respect the polynomial hierarchy, including elev^ but not elev^. 

Table 4 

MPMs obtained from MPIPs using EPE, RPE and FPE and pseudo-MPM with AIC 
weights for the blue hawker dataset 



Detection 

Presence 

EPE 

{l,date,date^ ,elev,elev^} 

{l,elev,elev^} 

RPE 

{1,date,date^,elev} 

{l,elev,elev^} 

FPE 

{1,date,date^,elev} 

{l,elev,elev^} 

AIC„ 

{1,date,date^,elev,elev^,year} 

{l,elev,elev^} 


The top ranked models in terms of the true (EPE) and estimated posterior 
probabilities (RPE and FPE), and from AlC-based selection are displayed in 
Table 5. The top model obtained with EPE, RPE and FPE are the same for both 
the presence and detection components, with the top AIC model not respecting 
the polynomial hierarchy in the detection component (including the elev^ but not 
elev^) and having only the year term in the presence component. Interestingly, 
four out of the top five models found by EPE coincide with those from RPE, 
whereas only two from EPE are among the top 5 discovered with FPE, indicating 
again faster convergence of the renormalized estimates when compared to the 
frequency based ones. Again, it is worth emphasizing that the probability mass 
with AIC weight is much more diluted across the model space than with any of 
its Bayesian counterparts. 

To conclude, a notable advantage of the Bayesian approach is that the uncer¬ 
tainty associated to the choice of a particular model can be assessed using the 
model posterior probabilities, whereas this is not the case with AIC selection, 
as the AIC weights do not correspond to actual posterior probabilities. 
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Table 5 

Top ranked models using EPE, RPE, FPE and AIC weights for the blue hawker dataset 



Detection 

Presence 

p{My,M/y) 

EPE 

{l,date,date^,elev} 

{l,elev,elev^} 

0.2090 

RPE 

{l,date,date^,elev} 

{l,elev,elev^} 

0.3725 

FPE 

{l,,date,date^,elev} 

{l,elev,elev^} 

0.1974 

A-ICui 

{1, date, date^,elev,elev^,year} 

{l.year} 

0.0422 


6. Discussion 

This paper developed the first objective Bayes methodology for variable selec¬ 
tion using single-season site occupancy models, based on intrinsic priors derived 
from non-informative priors. This solution uses latent variables to data-augment 
the analysis, helping to seamlessly calculate the model posterior probabilities. 
Working on the latent scale additionally facilitates the construction of a straight¬ 
forward MCMC sampler and posterior estimation using sample averages. 

Because the intrinsic priors are built from non-informative priors, the need for 
hyperparameter specification is avoided, making the method entirely automatic 
and widely applicable. Additionally, the types of prior distributions assumed on 
the model space (HIP, HOP and HUP) enforce the heredity constraints required 
when performing selection with interactions and higher-order polynomial pre¬ 
dictors. These classes also allow for stronger penalization than the usual equal 
probability prior, further helping control the false positive rate. These have been 
shown to be particularly useful in problems with small and moderate sample 
sizes (for more details see Taylor-Rodriguez et ah, 2015). An important advan¬ 
tage of our method, relative to the AlC-based selection, is that the resulting 
model posterior probabilities provide a measure of uncertainty associated with 
choosing a particular model. 

The stochastic search algorithm can be used to thoroughly explore large 
model spaces using the renormalized posterior estimates (instead of the frequency- 
based ones). This tool will allow practitioners to explore the model space with¬ 
out having to enumerate it or preselect a subset of models, enabling its use with 
larger model spaces. 

The simulation experiments confirmed the ability of the method to identify 
the predictors present in the true model when considering both the highest and 
median probability models. The objective Bayes method proved to be compet¬ 
itive with AIC in detecting true predictors, and greatly outperformed AIC in 
reducing the number of false positive predictors included in the models with 
high posterior probabilities. 

The software used throughout the article was built into the R package Oc- 
cOBayes available at request. This package includes functions to run the variable 
selection procedure, as well as some auxiliary functions to validate a set of “best” 
models using a hold-out data set. 
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Appendix A: Model Selection Algorithm 


For each of the two components of the model -presence and detection- the 
algorithm first draws the set of active predictors (i.e., and Ay) together 
with their corresponding parameters. This is a reversible jump step that uses a 
Metropolis Hastings correction with proposal distributions given by 


q{A*y\y,Zo, 


zW,vW,MaJ 

II 

to 1 

^p(Ma.|zo,z^*\v 


1 

2 

^p(MA^|y,Zo,z^*) 


V-i) 


where L{Ma^) and L{MAy) denote the sets of models obtained by adding or 
removing one predictor at a time from the corresponding feasible sets of nodes 
in Ma^ and Ma^ , respectively. Here Zq are the observed presence indicators and 
Zu are those that are unobserved. 

To promote mixing, this step is followed by an additional draw from the full 
conditionals of a and A. The densities p(q:o|.), p(q!a|-)j p(-^o|-)i a'ndp(AA|.) can 
be sampled from exactly via Gibbs steps. Using the notation a|- to denote the 
random variable a conditioned on all other parameters and on the data, these 
densities are given by 

. aol—AA((X^,Xo)-iX'v,(X'Xo)-i), 

• O-aY ~ A/" where the mean vector and the covariance matrix 

are given by = 2 N+pa, and = (S„^X'^v), 

. Ao|-~AA((Q'Qo)-iq;,w,(Q'Qo)-i), and 

• Aa I • ~ A/” ), analogously with mean and covariance matrix given 

= 2j!+pAy (QaQ^)~^ and = (ExaQ'a^)- 

Finally, Gibbs sampling steps are also available for the unobserved occupancy 
indicators z^, and for the corresponding latent variables v and w. The full 
conditional posterior densities for Zu'^^\ and are those described 

in Dorazio and Taylor-Rodriguez (2012) for the single-season probit model. 

The following steps summarize the stochastic search algorithm: 


1 . 

2 . 


Initialize A ^^, , zl°^, , ccq'^"' , Aq 


.(0) \(o) 


v(0),wW, 

Sample the model indices and corresponding parameters: 
(a) Draw simultaneously 


• A* - 9(A2|zo,zL*\ v(*),MaJ, 

• cXq ~p(ao|v^*^), and 

• - p(aA|ArA*,v(*)). 
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(b) Accept oiq, a*^,) with probabil¬ 

ity 

. A |zo,zl‘\v(*)) g(Ai‘^|zo,zl*\ 

dz = mm 1,-^-7-T-7-T-^ , 

y p(M^^t)|zo,zl\v(*)) (7 (A;|zo,z1\v(‘), MaJ J 

otherwise, let 

(c) Sample simultaneously 

• ~ 9(^y|y,Zo,Z«\w(‘), MaJ, 

• -^0 ^P(Ao|w(*)), and 

(d) Accept Aq= (Ma*,Ao, A^,) with probability 


6y = min I 1 


p{Mai |y, Zo, zl‘\ q(,A^P |y, z^, zl‘\ ) 

piM^it) |y, zo, zl*\ wd)) g(Aj|y, z^, zi*\ wW, MaJ 


otherwise, let AqA^'*’^^’^) = {Af, 

3. Sample base model parameters: 

(a) Draw ~ p(Q:o|v(‘y. 

(b) Draw ~ p(Ao|w(*y. 

4. To improve mixing, resample model coefficients not present the base model 
but are in Ma- 

(a) Draw p{aA\M^(t+i) 

(b) Draw ~ p(Aa|M (t+i), w^*)). 

5. Sample latent and missing (unobserved) variables: 

(a) Sample ZA ' ~p(z„|M^(t+i),y,a)^ ,“o ) 

(b) Sample ~ p(v|M^o+i), Zo, zi*’''^\ 

(c) Sample ~ p(w|M .(t+i), Zo, 


Appendix B: Additional tables blue hawker analysis 
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Table 6 

MPIPs with EPE, RPE and FPE and AIC weights, obtained from joint model for presence 
(top) and detection (bottom) components for the blue hawker dataset 



EPE 

RPE 

FPE 

AIC„ 

elev 

0.5346 

0.7971 

0.8338 

0.5538 

elev^ 

0.5228 

0.7885 

0.7956 

0.3542 

elev^ 

0.2041 

0.2923 

0.2988 

0.6626 

year 

0.1130 

0.0676 

0.3421 

0.4125 



EPE 

RPE 

FPE 

AIC„ 

date 

0.9999 

1.0000 

1.0000 

1.0000 

date^ 

0.9999 

0.9737 

0.9632 

1.0000 

elev 

0.9852 

0.9590 

0.9582 

0.9630 

elev^ 

0.5170 

0.2591 

0.2797 

0.4099 

elev^ 

0.2601 

0.0921 

0.1114 

0.5453 

year 

0.2169 

0.0658 

0.2845 

0.5566 


Table 7 

Top five models with RPE, FPE and AIC for the blue hawker dataset 

EPE 



Detection 

Presence 

Post 

1 

{1, elev,date,date^} 

{1, elev.elev^} 

0.2090 

2 

{1, elev,date,elev^,date^,elev^} 

{1} 

0.1763 

3 

{1, elev,date,elev^.date^} 

{1} 

0.1747 

4 

{1, elev,date,date^} 

{1, elev,elev^,elev^} 

0.1200 

5 

{1, year,elev,date,elev^,date^,elev^} 

{1} 

0.0568 

RPE 


Detection 

Presence 

p{My,M^\y) 

1 

{1,elev,date, date^} 

{l,elev,elev^} 

0.3725 

2 

{1,elev,date, date^} 

{l,elev,elev^,elev^} 

0.2058 

3 

{1,elev,date, elev^.date^} 

{1} 

0.1055 

4 

{1,elev,date, elev^,date^,elev^} 

{1} 

0.0656 

5 

{1,elev,date, date^} 

{1,year,elev,elev^} 

0.0308 

FPE 


Detection 

Presence 

p(My,M^\y) 

1 

{1,elev,date, date^} 

{l,elev,elev^} 

0.1974 

2 

{1,elev,date, date^} 

{l,elev,elev^,elev^} 

0.1138 

3 

{1,elev,date, date^} 

{1,year,elev,elev^} 

0.1023 

4 

{1,year,elev,date,date^} 

{l,elev,elev^} 

0.0728 

5 

{1,elev,date, date^} 

{1,year,elev,elev^,elev^} 

0.0599 

AIC™ 


Detection 

Presence 

AIC™(Mj,,Af,|y) 

1 

{1,date, date^, elev,elev^,year} 

{l.year} 

0.0422 

2 

{1,date, date^, elev} 

{l,elev,elev^} 

0.0403 

3 

{1,date, date^, elev,year} 

{l,elev,elev^} 

0.0348 

4 

{1,date, date^, elev,year} 

{1, elev, elev^,year} 

0.0321 

5 

{1,date, date^, elev,elev^,year} 

{1} 

0.0254 
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